First thing's first, I will import necessary libaries.
import folium
import requests
import pandas as pd
import distinctipy
In light of me going to Seattle this summer for an internship (YAY!), I decided to use a dataset containing locations of public transit stops in King County Washington. I will then differentiate them by the times they stop servicing commuters. For example, stops serviced after 8:00pm, serviced after 10:00pm, so on and so forth.
The datasets and information links can be found HERE:
stops_df = pd.read_csv("data/stops.txt")
stops_df.sort_values("stop_id", inplace=True)
stops_df.reset_index(inplace=True)
stops_df.head()
| index | stop_id | stop_code | stop_name | stop_desc | stop_lat | stop_lon | zone_id | stop_url | location_type | parent_station | stop_timezone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1073 | 250 | 250 | 2nd Ave & Bell St | NaN | 47.613705 | -122.345253 | 21 | NaN | 0 | NaN | America/Los_Angeles |
| 1 | 1117 | 260 | 260 | 2nd Ave & Lenora St | NaN | 47.612099 | -122.342522 | 21 | NaN | 0 | NaN | America/Los_Angeles |
| 2 | 1266 | 280 | 280 | 2nd Ave & Stewart St | NaN | 47.610615 | -122.340248 | 21 | NaN | 0 | NaN | America/Los_Angeles |
| 3 | 1430 | 300 | 300 | 2nd Ave & Pike St | NaN | 47.608646 | -122.338432 | 21 | NaN | 0 | NaN | America/Los_Angeles |
| 4 | 1593 | 320 | 320 | 2nd Ave & Seneca St | NaN | 47.606213 | -122.336205 | 21 | NaN | 0 | NaN | America/Los_Angeles |
stop_times_df = pd.read_csv("data/stop_times.txt", low_memory=False)
stop_times_df.sort_values("stop_id", inplace=True)
stop_times_df.reset_index(inplace=True)
stop_times_df.head()
| index | trip_id | arrival_time | departure_time | stop_id | stop_sequence | stop_headsign | pickup_type | drop_off_type | shape_dist_traveled | timepoint | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 618796 | 540580380 | 15:23:00 | 15:23:00 | 250 | 1 | NaN | 0 | 0 | 0.0 | 1 |
| 1 | 606487 | 540561811 | 16:47:00 | 16:47:00 | 250 | 1 | NaN | 0 | 0 | 0.0 | 1 |
| 2 | 606486 | 540561810 | 16:47:00 | 16:47:00 | 250 | 1 | NaN | 0 | 0 | 0.0 | 1 |
| 3 | 37379 | 350252621 | 08:19:00 | 08:19:00 | 250 | 1 | NaN | 0 | 0 | 0.0 | 1 |
| 4 | 618660 | 540580181 | 16:23:00 | 16:23:00 | 250 | 1 | NaN | 0 | 0 | 0.0 | 1 |
Since there is a lot of data we do not care about, I will remove all those columns to make things easier to follow and less crowded.
stops_df.drop(columns=["stop_code", "stop_desc", "zone_id", \
"stop_url", "stop_timezone", "parent_station", "location_type"], inplace=True)
stops_df.head()
| index | stop_id | stop_name | stop_lat | stop_lon | |
|---|---|---|---|---|---|
| 0 | 1073 | 250 | 2nd Ave & Bell St | 47.613705 | -122.345253 |
| 1 | 1117 | 260 | 2nd Ave & Lenora St | 47.612099 | -122.342522 |
| 2 | 1266 | 280 | 2nd Ave & Stewart St | 47.610615 | -122.340248 |
| 3 | 1430 | 300 | 2nd Ave & Pike St | 47.608646 | -122.338432 |
| 4 | 1593 | 320 | 2nd Ave & Seneca St | 47.606213 | -122.336205 |
Now I will get the latest departure/arrival from the stop_times_df, for each stop (stop_id) in the stops_df, and insert that value into its own column for that stop as "latest_service_time"
stop_ids = stops_df["stop_id"].to_list()
latest_service_times = []
for stop in stop_ids:
# get the service times for this stop
times = stop_times_df.loc[stop_times_df["stop_id"] == stop]["departure_time"]
# get the latest time
latest_time = times.max()
# store this time
latest_service_times.append(latest_time)
stops_df.insert(5, "latest_service_time", latest_service_times)
stops_df
| index | stop_id | stop_name | stop_lat | stop_lon | latest_service_time | |
|---|---|---|---|---|---|---|
| 0 | 1073 | 250 | 2nd Ave & Bell St | 47.613705 | -122.345253 | 18:54:00 |
| 1 | 1117 | 260 | 2nd Ave & Lenora St | 47.612099 | -122.342522 | 18:26:46 |
| 2 | 1266 | 280 | 2nd Ave & Stewart St | 47.610615 | -122.340248 | 24:03:25 |
| 3 | 1430 | 300 | 2nd Ave & Pike St | 47.608646 | -122.338432 | 18:30:00 |
| 4 | 1593 | 320 | 2nd Ave & Seneca St | 47.606213 | -122.336205 | 24:31:00 |
| ... | ... | ... | ... | ... | ... | ... |
| 6665 | 6661 | 99755 | SE 27th St & 124th Ave SE | 47.586197 | -122.174263 | 18:57:44 |
| 6666 | 6662 | 99756 | Juanita Woodinville Way NE & NE 163rd Pl | 47.746174 | -122.184113 | 23:41:49 |
| 6667 | 6664 | 99760 | Brickyard Rd NE & NE 170th Pl | 47.751667 | -122.179375 | 23:40:33 |
| 6668 | 6668 | 99908 | Vashon Passenger Ferry & Vashon Ferry Dock | 47.510941 | -122.464722 | 18:58:00 |
| 6669 | 6669 | 99997 | Water Taxi Route & Harbor Ave SW | 47.589497 | -122.379524 | 23:00:00 |
6670 rows × 6 columns
Now we need to make sure that these all have coordinates as that is what we care about, and handle any missing ones if necessary. As for stop_ids, they are required according to documentation so there shouldn't be any missing ones, and if there are I will consider them invalid. I will also do the same for the latest_service_times values.
lat_count = stops_df[pd.notnull(stops_df["stop_lat"])]["stop_lat"].count()
lon_count = stops_df[pd.notnull(stops_df["stop_lon"])]["stop_lon"].count()
lst_count = stops_df[pd.notnull(stops_df["latest_service_time"])]["latest_service_time"].count()
print(f"There are no null coordinates:\
{lat_count == lon_count == len(stops_df)}\n\
There are no null latest service times: {len(stops_df) == lst_count}")
There are no null coordinates: True There are no null latest service times: True
As indicated by the True output that was given, there are no null coordinates so no need to "handle" them.
Now I will get the latest arrival time for each stop (stop_id) and then add that to a new column for stops_df. This way we don't need to use the stop_times_df anymore, all of the info we need from that will be inserted into the stops_df.
Using Folium I will create a map that is focused around Seattle (King County) Washington where the data is centered around.
seattle_coords = [47.6062, -122.3321]
map_osm = folium.Map(location=seattle_coords, zoom_start=10)
map_osm
Here I will mark the public transit stops on the map. I will visually differentiate the stops such that stops that are serviced late (after 22:00 for example) will be a different color than stops only serviced earlier (before 22:00).
earliest = stops_df["latest_service_time"].min()
latest = stops_df["latest_service_time"].max()
print(f"Range of service ending times: {earliest} - {latest}")
buckets = [6, 8, 14, 18, 22, 26, 28, 30]
color_map = [
'red',
'blue',
'green',
'darkgreen',
'darkblue',
'purple',
'orange',
'darkpurple',
]
def get_color(time):
color = color_map[0]
for i in range(len(buckets) - 1):
if(time < buckets[i+1]): return color_map[i]
for i in range(len(stops_df)):
stop = stops_df.iloc[i]
lat = stop["stop_lat"]
lon = stop["stop_lon"]
stop_time = stop["latest_service_time"]
stop_time_hr = stop_time[0:2]
stop_color = get_color(int(stop_time_hr))
# plot marker
icon = folium.Icon(color=stop_color, icon_color='black')
folium.CircleMarker(location=[lat, lon], radius=2, color=stop_color).add_to(map_osm)
# print the meaning of the colors
for i in range(len(buckets)):
print(f"Color: {color_map[i]} = stops service before {buckets[i]}")
map_osm
Range of service ending times: 06:51:00 - 29:37:00 ['red', 'blue', 'green', 'darkgreen', 'darkblue', 'purple', 'orange', 'darkpurple'] Color: red = stops service before 6 Color: blue = stops service before 8 Color: green = stops service before 14 Color: darkgreen = stops service before 18 Color: darkblue = stops service before 22 Color: purple = stops service before 26 Color: orange = stops service before 28 Color: darkpurple = stops service before 30